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Abstract — This paper describes and provides an initial solution to a novel video editing task, i.e., video de-fencing. It targets automatic 
restoration of the video clips that are corrupted by fence-like occlusions during capture. Our key observation lies in the visual parallax 
between fences and background scenes, which is caused by the fact that the former are typically closer to the camera. Unlike in 
traditional image inpainting, fence-occluded pixels in the videos tend to appear later in the temporal dimension and are therefore 
recoverable via optimized pixel selection from relevant frames. To eventually produce fence-free videos, major challenges include 
cross-frame sub-pixel image alignment under diverse scene depth, and "correct" pixel selection that is robust to dominating fence 
pixels. Several novel tools are developed in this paper, including soft fence detection, weighted truncated optical flow method and 
robust temporal median filter. The proposed algorithm is validated on several real-world video clips with fences. 
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1 Introduction 

This paper describes a novel algorithm for automatic 
fence detection and removal in consumer video clips. We 
term this task to be "video de-fencing''. This technique 
can be especially useful in a variety of scenarios. When 
the target scenes that the users plan to capture are 
occluded by a fence and the users are not allowed to 
cross the fence (e.g., use a video camera to capture a tiger 
confined in the cage at the zoo), a natural solution is to 
record the scenes with fences and resort to specific post- 
processing algorithms for fence removal. As shown later 
in the paper, the implication of "fence'' can be largely 
generalized, which makes the proposed technique gen- 
eral enough for the daily use of the digital cameras. 
With the video de-fencing technique, users are able to 
obtain a fence-free video with litter additional effort. 
Moreover, recent trend in digital camera has shown 
the power of incorporating sophisticated algorithms into 
the camera hardware (e.g., high dynamic range photos). 
After the technique of video de-fencing becomes much 
more mature, it can be integrated as a part of the 
consumer camera hardware and attains real-time video 
re-touching. 

For those scenes occluded by fences, the goal of video 
de-fencing is to automatically restore them and return 
fence-free videos. There is a vast amount of research 
that is devoted to pattern (e.g., near-regular structures, 
rains 111, IE), snowflakes |3|) detection and removal in 
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both images and videos. Nonetheless, the video de- 
fencing problem is novel since fence-like structures have 
been seldom explored in video editing. The most rele- 
vant work to ours is the "image de-fencing" by Liu et 
al. [4], where the authors propose to solve the image 
de-fencing by two steps. First, the fences are detected 
according to spatial regularity (e.g., symmetry) and im- 
age masks are subsequently constructed. Afterwards, it 
utilizes sophisticated image inpainting techniques [5J for 
fence filling. The major limitation of the method stems 
from the assumption of repeating-texture of occluded 
scenes. The restoration of missing information within 
a single image is not a well-defined problem for gen- 
eral images, since in most cases the repeating-texture 
assumption fails to hold. In the follow-up work of 0, 
the authors further make attempt to overcome the afore- 
mentioned limitation by using multi-view images. Given 
an image /, it computes the optical flow (based on Lucas 
Kanade algorithm [7]) to another image from related 
view. The flow field is then used to aid finding patch cor- 
respondence across multi-view images. In this way the 
scarcity of source information for fence filling is partially 
mitigated. The method is possible to generate plausible 
results on some images. However, note that the method 
does not capitalize on the temporal information con- 
tained in the videos, especially not explicitly addressing 
the issue of visual parallax, which fundamentally differs 
from our proposed problem setting and corresponding 
solutions. Another straightforward solution for video de- 
fencing is to manually mask out the fences and perform 
video completion [8], [9J. However, mask generation is 
known to be labor-intensive, especially for those web- 
like, thin fences. 

Our research is inspired by the significant advances 
in the field of computational photograph}|^ By varying 
specific camera parameters {e.g., flash, view points, shut- 

1. Visit htt p:/ / en.wikipedia.org/ wiki/ Computational_photography| 
for a quick^ference 
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Fig. 1 . Illustration of the algorithmic pipeline for video de-fencing. It shows the operations on a frame selected from fence-corrupted video clips 
"Flower". All the operations (including fence detection and removal) are fully automatic. 



ter, depth of field) within a small range during image or 
video capture, various difficult tasks can be simplified. 
An example is to solve image denoising and detail 
transfer under low-lighting conditions by combining the 
strength of flash /no-flash image pairs [10]. For video 
de-fencing, the visual parallax [llj is observed when 
the depth of field is large, which provides useful cue 
for video analysis if the motion path of camera follows 
specific patterns {e.g., roughly parallel to the fences and 
scenes), as illustrated in Fig. [2] It is shown that fence 
pixels {i.e., pixels from the image regions that correspond 
to the fence) show stronger drifting tendency compared 
with scene pixels {i.e., pixels of the image except for fence 
pixels), which is known as visual parallax and serves as 
the basic cue to distinguish these two kinds of pixels. 
Moreover, under a large variety of real-world scenarios 
the camera moves in such a way that it is guaranteed 
with high probability that each pixel from scene objects 
is only occluded in partial of the frames and becomes 
visible in others, enabling occlusion removal via pixel 
selection from relevant frames. 

Our main contribution is the exposition of an inte- 
grated video de-fencing algorithm, which consists of 
three successive steps: 1) estimating probability of fence 
(PoF) for each pixel, 2) parallax-aware sub-pixel image 
alignment via the proposed weighted truncated optical flow 
method, and 3) robust temporal median filter towards pixel 
restoration based on low-rank subspace optimization 
theory. Fig. [T] illustrates the algorithmic pipeline of our 
proposed method. 

As an initial study of an emerging topic, we focus on 
the case of static scenes. The problem setting, especially 
the hypothesis about the scenes and camera motions, 
will be in detail addressed in Section |3] We target a fully 
automated method for solving this problem, unlike those 



works taking users in the loop |[T2l , ||6l. The extensions of 
the proposed method to other challenging settings {e.g., 
dynamically-moving objects in the scene) are of great 
important for practical consumer videos, which we keep 
as future exploration but discuss the possible solutions 
{e.g., motion layer analysis) in Section [7| 

2 Related Works 

2.1 Video Editing and Composition 

A vast literature has been devoted to video editing or 
composition in the past decade, such as semantic object 
cutout 1 13 1, motion magnification fT4l, video stabiliza- 
tion [15], video matching [16J and video completion 0. 
An interesting topic in this field is the removal of 
characteristic structures, such as rains or snowflakes. For 
example, Garg et al. [IJ treated the visual manifestations 
of rain as a combination of both the dynamics of rain 
and the photometry of the environment. A correlation 
model that captures the dynamics of rain and a physical 
motion model that explains the photometry of rain are 
coupled for detecting and removing rain from videos. 
The work in [4J investigated semi-automatic fence de- 
tection and removal for images. However, the de-fencing 
quality therein is heavily dependent on the availability 
of repeating textures. The work in |6| later extends 
the method to multi-view images, which is the most 
relevant work to the proposed method in this paper. 
However, the multi-view input therein are only loosely 
correlated and no strict temporal consistency is enforced. 
Heterogeneous views of the same scenes are used to 
find matching patches from the score of SSD (sum of 
squared difference) between local patches, rather than 
from temporal alignment and parallax cues. In this sense. 
Park et al. [6] failed to provide an integrated framework 
to utilize visual parallax for video de-fencing. 
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Fig. 2. Illustration of the underlying mechanism of video de-fencing. The left figure depicts the relative positions between the scene (containing 
two objects denoted in green and yellow respectively), fence (denoted in red) and camera, together with the camera motion path at time T and 
T + l respectively. The right figure intuitively explains the parallax phenomena caused by diverse scene depth, whose right panel further shows 
the situation after parallax-aware frame alignment. Namely, the frame alignment is accomplished in such a way that scene objects are well aligned 
across multiple frames. From the principal of parallax, the fences will fail to be aligned and it therefore enables fence detection and removal. 



Major difficulties in this field stem from the chal- 
lenging problems like depth estimation and point corre- 
spondence. In image stitching and panorama generation 
from multiple images or videos, correspondence can 
be obtained using invariant features [17J. However, as 
stated in [18], parallax caused by depth significantly 
degrades the final quality yet has not been adequately 
solved. Various methods have been proposed to estimate 
depth Kill, lEOJ. 

2.2 Robust Data Recovery 

Prior work on video completion ID, m treats video as 
spatio-temporal cube. Various diffusion procedures are 
utilized to fill the missing parts in the videos. As will 
be shown later, our proposed method adopts a different 
idea. It basically hinges on non-parametric parallax- 
aware frame alignment, which also differs from geomet- 
ric reconstruction based methods [21 J. Pixel restoration 
on aligned frames boils down to robust data recovery 
in the existence of arbitrarily-corrupted outliers. Median 
filter is widely adopted for this task owing to its empiri- 
cal success. Many variants have also been proposed, e.g., 
weighted median filter (WMF) [22J. 

Median filter operates on scalars. For images or videos, 
the spatial smoothness or temporal coherence provides 
contextual regularization to mitigate the adverse effect of 
outliers. Since high-dimensional representation (vector, 
matrix, or tensor) is more natural for visual data, matrix- 
based robust recovery has recently attracted increasing 
attention. Recent endeavor on visual recovery borrows 
tools from the sparse learning [23J and optimization 
theory. A representative work can be found in [24], 
wherein several exemplar applications are presented like 
face recovery and surveillance background modeling. 

3 Overview 

Our basic observation is that pixels occluded by the 
fences in a frame tend to become un-occluded along with 
the camera motion. In other words, suppose multiple 




Fig. 3. The desired camera motion is related to the geometry of 
the fences. This example illustrates the effect of two kinds of camera 
motions (i.e., vertical and horizontal motions respectively) on the "T"- 
shaped fence. 



consecutive frames are carefully aligned in pixel-by- 
pixel manner, ensuring in most cases the identically 
coordinated pixels along the temporal axis correspond to 
the same semantic objects except for the occluded pixels. 
The occlusion by fences can be consequently eliminated 
via pixel substitution from the relevant frames which are 
unobstructed from the corresponding viewpoint. Fig. |2] 
illustrates the principal mechanism for solving the video 
de-fencing problem, highlighting the parallax resulting 
from disparate scene depths. As in prior exposition, 
we make reasonable assumptions for both achieving 
a solvable problem and covering a wide ranges of 
consumer videography. Specifically, the most important 
assumptions are listed as below: 

• Fence-like occlusions are overwhelmingly closer to 
the camera compared with the target scene. Note 
that here the term "fence'' generally refers to any- 
thing excluded from the target scene, like an object 
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Fig. 4. Illustration of the restoration of occluded pixels. The top row 
shows a consecutive video frame sequence. Here the goal is to remove 
the occluding tree from the scene. The green line denotes an arbitrary 
scan line in the first frame along the horizontal direction. Putting the first 
K frames (K = 7, 1 0, 1 3 respectively in this example) together with scene 
pixel aligned (see Section|4]for more technique details), it is possible to 
get image slices along the scan line. It is intuitively observed that pixel 
A on the scan line cannot be correctly restored since it is occluded for 
all K values, and pixel B can be restored only for K = 13. Best viewed 
in color. 



We term this task as 'Video de-fencing'', which further 
boils down to two sub-tasks, i.e., probability-of-fence (PoF) 
estimation and pixel selection. The former refers to the 
identification of those pixels undergoing fence occlu- 
sions, and the latter task tries to restore the visual infor- 
mation of high PoF pixels from temporally neighboring 
frames. Due to the ambiguity in pixel correspondence 
and complex scene structure, both sub-tasks are known 
to be difficult. The following sections address these sub- 
problems respectively. 



4 Probability-of-Fence (PoF) Estima- 
tion 




Fig. 5. The left figure draws the result using the state-of-the-art fence 
detector [4]. Several fence grids fail to be detected. It also tend to fail 
on non-repeating fences. The right figure shows the image completion 
result (with manually-labeled fence mask) using the build-in function in 
PhotoShop CSS. 



right behind or in front of the real fence. It is 
also expected that the fence has a long, thin shape. 
As will be shown in the experiments, such shapes 
benefits more robust and exact video restoration. 
• The moving path of the camera is approximately 
parallel to the planes of fences. The goal is to 
make two consecutive frames approximately un- 
dergo affine transform, avoiding untractable per- 
spective distortion. Moreover, the ideal camera mov- 
ing direction is heavily dependent on the geometry 
of the fences. An example is shown in Fig. [S] Due 
to the special 'T"-shaped fence, neither vertical nor 
horizontal camera motion is able to generate the ex- 
pected visual parallax. Only an in-between camera 
motion can capture all branches of the fence. Finally, 
the magnitude of camera motion should be salient 
enough to guarantee the un-occlusion of every part 
of the scene in a number of consecutive frames. 
Fig. |4] shows an example, where it is observed that 
some pixels are impossible to be restored under 
reasonable parameters. It is due to two reasons, 
either thick fence {e.g., the tree in Fig. ^ or weak 
camera motion, both of which should be taken into 
account during video capture. 

The above rules imply that the proposed method is not 
applicable for general scenes. However, the rules cover a 
large spectrum of short-duration consumer videos (typ- 
ically lasting only several seconds) and a user becomes 
qualified to capture the desired video clips after simple 
training. 



The goal of this stage is to estimate the confidence 
value of each pixel coming from fences. Before proceed- 
ing, we first introduce two existing solutions towards 
this goal: 

• In prior study "image de-fencing" [4J, fences are 
assumed to have visual regularity such as structural 
symmetry. However, our empirical investigation re- 
veals its inapplicability. The diversity of fence ap- 
pearance (see Fig. [TT| can be hardly encompassed 
using simple visual rules like symmetry or local 
low-rank texture Il25l . Even state-of-the-art fence 
detector need be improved to be more efficient and 
practical for real-world video sequences. Fig. |5] pro- 
vides a failure example on the video clip "Tennis" 
(the left sub-figure). Fig. 11 further presents some 
video clips (e.g., "Winter Palace" and "Temple") that 
do not contain any symmetry, which complicates the 
fence detection. 

• It is another natural solution to manually specify 
an image mask and resort to image completion 
algorithms [26], [27]. Fig. |5] presents the results 
obtained by the "content-aware fill" function im- 
plemented in commercial software PhotoShop CS5. 
This new feature helps the users retouch any image 
in removing unwanted areas. It is able to do it by 
filling the space by utilizing pixels which surround 
it. However, occlusion is generally un-recoverable 
from a single frame. 

Unlike previous work, we perform probability-of-fence 
(PoF) estimation based on visual parallax. Two types 
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of cues are utilized to infer the PoF values, i.e., visual 
flow analysis and image appearance differencing de- 
picted in Sections 4.1 and |4.2| respectively. Incorporating 
the motion cue from visual flow analysis reflects the 
assumption that scene background and fence-like objects 
can be effectively distinguished from parallax. However, 
in some cases, due to the ambiguity in the motion 
estimation, motion cue can be noisy (see Fig. [6|a) for an 
example), which motivates us to utilize appearance cue. 
The basic idea is to first estimate the mode of the mo- 
tions within two consecutive frames, and then perform 
frame alignment accordingly. After differencing aligned 
frames, those pixels with large appearance variances 
will be assigned high probabilities of being from fences, 
since their motion vectors probably deviate a lot from 
the motion mode. However, as shown in Fig. |6|c)(d), 
appearance cue tends to generate undesired striped PoF 
patterns for uniform-colored fences (since the response 
is only strong at the edge of the fences), which indicates 
that these two kinds of cues are indeed complementary 
to each other in many cases. 

Due to the video noises and the inconsistency between 
the presumed motion models and real-world scene ge- 
ometry, making binary decision per pixel {i.e., coming 
from either fence or target scene) is error-prone. Instead, 
each pixel is assigned soft confidence value in [0, 1], 
indicating the probability to be from the fence. The final 
PoF confidence values are obtained via the linear com- 
bination of these complementary information channels. 

4.1 Motion Cue 



(a) Motion-based PoF 



(b) Appearance-based PoF 
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(c) Motion-based PoF 



(d) Appearance-based PoF 



Fig. 6. PoF estimations for the 20-th frame in video clip "Tennis" (see 
Fig.|5]for an exemplar frame from this video clip) and the 24-th frame in 
video clip "Football", shown in the top row and bottom row respecively. 
Note that the motion cue and appearance cue tend to be complementary 
to each other. See text for more explanation. 



Since the video clips are assumed to be continuously 
captured, each frame can be reasonably mapped to the 
coordinate system of consecutive frames according to 
tiny motion flows (usually fewer than 3-pixel displace- 
ment). However, the motion magnitude of fence pixels 
are notably larger due to the parallax phenomena. Con- 
sequently, it provides the possibility to judge the fence 
by distinguishing the saliently-shifted pixels. 

The optical flow method [28], [29J is employed to- 
wards the aforementioned aim. Suppose from frame 
to frame it undergos a motion field valued as 

w{p) = (u^v) at the pixel with index p, where u and 
V denote the horizontal or vertical flow respectively. 
Under the Lambertian surface and brightness constancy 
assumptions, and the piecewise smoothness prior, we 
adopt the following objective function as in 1291 to guide 
the motion field optimization, i.e., 

J{u, v) = I ^ {\F{p ^w)- F{p)\) + acly{\\/uf + | V^|'), 
Jp 

where V^(-) and </)(•) are robust norms taking the forms 
iIj{x) = + and = + (e is a small 

constant for numerical stability). |Viip = -\- Uy pe- 
nalizes large motion gradients (ux = du/dx, Uy = 
du/dy). The high non-convexity of J{u,v) makes the 
optimization easily trapped in local optima. To address 
this issue, Gaussian image pyramid is constructed. The 
optimization initially starts from low-resolution image 
levels and then propagates to finer levels by bilinear 
interpolation. Following the work in [29J, we calculate 
the first-order Taylor expansion of J{u,v) and adopt 
the iterative rezveighted least squares (IRLS) scheme for 
updating, which demonstrates high efficiecy (roughly 2 
seconds for a 480 x 240-pixel image). Fig. |6] shows an 
example. 

After obtaining the motion fields, it is necessary to 
compute its principal direction to abandon redundant 
information, which can be trivially computed from the 
covariance matrix, i.e.. 



C = Ep [{wp - w)^{wp - w)] , 



(1) 



where w is the averaged motion vector. The principal 
direction (equivalent to the camera motion direction) is 
known to be the eigen-vector of matrix C associated with 
the maximal eigen- value. Denote it to be c. All motion 
vectors are then projected onto c to reduce redundant 
component^ obtaining 1-D scalar m{p) = • w{p) for 
pixel p. The final confidence of fence-ness is calculated 
by choosing two thresholds Oi and 0^ and performing a 
linear mapping as below: 



(2) 



Oh - Oi 

where 0^ > Oi and m{p) = max(min(m(p), 6>/i), 6>/) 
for robustness consideration. In implementation, both 

2. Note that both c and — c equivalently play the role of principal 
eigen-vector. We adopt the one which ensures E{c^w) > 0, forcing the 
fence pixels to have larger positive projection values. 
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thresholds are generated in data-driven manner, i.e., Oh, 
Oi are chosen to be 90%, 10% largest values among all 
the projected values. For frame F^, it has bidirectional 
mapping {i.e., to frame F^~^ or frame We compute 

the confidence value in each direction and takes the 
averaged value as the final result. See Fig. [6] for an 
example. 

4.2 Appearance Cue 

The optical flow method is often inaccurate due to the 
aperture problem and corrupted pixels during capture. 
We empirically find that appearance cue complements 
above-mentioned motion cue in many cases (see Fig. |6] 
for an example). Specifically, we assume that frame is 
mapped to frame by parametric affine transform, 
i.e., the original coordinate (x, y) is projected to the new 
one {x' ^y') via x' = aix + a2y + as, y' = a/^x + a^y + 
aQ, where ai, . . . , ae are the coefficients to be optimized. 
At least three correspondences are required to estimate 
these six parameters. 

For this aim, we utilize the local keypoint based image 
alignment algorithm |18|. On each video frame, SIFT 
features are extracted and matched between consecutive 
frames. Since the coordinates of SIFT features in frames 
F^ and are known, the six parameters can be reli- 
ably estimated from the SIFT correspondences by least- 
squares. Note that the fences and scene always undergo 
different affine transforms due to the visual parallax. 
To enhance the robustness, we further modulate each 
pixel by their motion-induced PoF value fm{p)- The 
SIFT feature with high fm{p) values will be assigned 
low weights (in practice we adopt l-/m(p))/ resulting a 
weighted least-squares estimator. 

With the estimated affine transform, the temporally- 
adjacent frames can be accordingly aligned to a specific 
frame and thereby it enables the analysis of any pixel 
p on this frame using the geometrically-aligned spatio- 
temporal cube. We can get the property of a pixel p by 
analyzing a spatial-temporal patch around it (in practice 
we adopt the size 5x5x3). Various operators have 
been proposed to estimate the regularity of such spatial- 
temporal structures, e.g., the eigen-spectrum based mo- 
tion estimator [30]. Intuitively, any aligned pixel tends 
to be from target scene if it has small intensity variation 
along the temporal dimension (see the frame slices in 
Fig. |4|. For numerical stability, we regularize it using 
the variation along the {x^y) dimensions. For each pixel, 
its variations along the (x, y) image plane and temporal 
dimension are estimated and denoted as cFxy{p), cFt{v) re- 
spectively. The appearance-induced fence-ness is defined 
as below: 



faip) ^ tanh 



2--f^axy{p) 



e [0,1], 



(3) 



4.3 Bundle Adjustment 

The final PoF value is computed by linearly combining 
fm{p) and fa{p), i.e., f{p) = Xfm{p) + (1 - X)fa{p)- After 
the computation over all frames, bundle adjustment can 
be adopted for further noise suppression. Let JV{p) be 
the index set of neighbors (either in spatial or temporal 
scale) for pixel p, its PoF value is updated according to 



+ (1 - (4) 



where 7 = Ep{axy{p)) is introduced to penalize small 
crxy{p) in uniform image regions, and tanh(-) is a func- 
tion that maps the input scalar to the range [0, 1]. 



where k, is a free parameter to control the resistent 
strength to the incoming information. Given a well- 
defined neighborhood system, the above procedure is 
known to have convergence guarantee 1311 . 

5 Pixel Restoration 

The next crucial step is to perform restoration on the 
occluded pixels (i.e., those associated with high PoF 
values). Unlike texture-based inpainting in image de- 
fencing, the video de-fencing capitalizes on temporal 
consistency of the pixels in the geometrically-aligned 
frame sequences. If the true color of a pixel has ever been 
exposed to the camera on partial frames, it is theoreti- 
cally recoverable. There are two challenges that remain 
towards the ultimate goal, i.e., sub-pixel frame alignment 
and afterwards robust temporal filtering, which are de- 
tailed in Sections |5.1| and |5.2| respectively. 

A naive solution is first performing frame alignment 
using the global affine transform learned by the method 
in Section 4.2 followed by temporal median filtering to 
restore occluded pixels. However, this idea practically 
suffers from several factors on the video clips captured 
by hand-held cameras. On one hand, the method in Sec- 
tion |42] cannot achieve sub-pixel image alignment under 
large occlusions and complicated scene depth structure. 
On other hand, naive temporal median filter (N-TMF) only 
works when the "correct'' pixels dominate in quantity 
(see Fig. 4j where pixel B on slice-3 is theoretically 
recoverable yet the recovery will fail via N-TMF). 

To remedy these problems. Section 5.1 elaborates on an 
occlusion-resistent image alignment algorithm based on 
truncated optical flow computation, wherein both sub- 
pixel accuracy and robustness to small scene motions 
are feasible. Moreover, we also propose robust temporal 
median filter (R-TMF) in Section 5.2 which is possible to 
return the correct value even in the case that fence pixels 
dominate the pixel collection in quantity. 

5.1 Parallax-aware sub-pixel frame alignment 

To ensure all pixels observed in partial of the frames, the 
cardinality of the frame set used for pixel restoration is 
required to be large enough. For frame F*, we choose the 
previous M frames {F^-^, F^-^+i, . . . , F^-i} together 
with the next M frames . . . , as the 

working set. As pre-processing, these frames are roughly 
aligned according to the transform matrix learnt using 







(a) Slice at y = 100 
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(b) Slice atx = 100 

Fig. 7. Exemplar results of the proposed alignment algorithm. The 
spatial-temporal cube is centered at the 25-th frame in clip "Football" 
(the original frame is seen in Fig.jsJ. (a) and (b) plot the slices along 
specific horizontal (y = 100) or vertical (x = 100) position respectively. 
The meanings of three slices from top to bottom: those before sub-pixel 
alignment, aligned fence-ness values, and those after sub-pixel accurate 
alignment. 



the method in Section l42l Note that there is a dilemma to 
select proper value of parameter M. Large M tends to 
convey more useful information yet complicates frame 
alignment. In practice we find that setting M = 7 is 
proper for most video sequences. 

Our initial study reveals the incapability of assuming 
holistic or block- varying geometry transform (e.g., affine 
or projective), which tends to produce over-blurry results 
and discard thin scene objects like flagpoles due to the 
accumulated misalignment within the 2M + 1 frames. In 
practice sub-pixel accuracy is required to guarantee the 
performance. However, the algorithm will fail in the case 
that fences are also aligned, which nullifies the temporal 
cue used for pixel restoration. As stated above, fence or 
scene pixels are distinguished according to the motion 
magnitudes, i.e., the parallax. An ideal image alignment 
is expected to take parallax into account. 

To address above issues, we propose a weighted, 
truncated optimal flow method. The objective function 
to be minimized can be expressed as: 



{l-f{p))-^{\F{p^w)-F{p)\) 



+a-/(p)-(/)(|Viip- 
under the following constraints: 

- 9u < U < 0u, 

-0y<V< Oy. 



\Vv\') 



(5) 



(6) 
(7) 




PoF values {f{p)} are utilized to suppress fence-like 
pixels. Parameters Our Oy are used to truncate salient mo- 



Fig. 8. The close-up views of restored frame under different 77. It is 
observed that high 77 tends to distort local image structures. 



tions probably from fences. The optimization described 
in ^ iterates between and each of its 2M temporally- 
adjacent frames. For compensation of accumulated mis- 
alignment between F* and we empirically find 
Ou = Oy = l.bk-\-l works well on most videos. See Fig.[7| 
for an example. 

Recall that Ou, Oy tend to be loose bounds. To avoid 
large motions distorting the overall motion field, another 
form of motion truncation is executed after computing 
the motion vectors. Particularly, pixels lie on different 
scan lines along the principal motion direction. The 
motion means and standard variations along each scan 
line are calculated, denoted as fim and am respectively. 
Afterwards all motions on this line are truncated to be 
within [jj^rn — V ' (^rmf^m + ^ ' cTm]- We empirically find 
that smaller r] (e.g., 0.1) produces reasonable results. See 
Fig. [8] for an example. 

5.2 Robust Temporal Median Filter (R-TMF) 

In the aligned h x w x {2M -\- 1) spatial-temporal cube, 
at most 2M + 1 pixels reside on each line orthogonal 
to the image plane. A direct solution to pixel restora- 
tion is applying naive temporal median filter (N-TMF) 
onto such pixel ensemble to resist the detrimental fence 
pixels. Unfortunately, for reliable restoration via N-TMF, 
it is crucial to ensure the "correct'' pixels dominates, 
which makes N-TMF unstable and fail in several cases, 
including small parallax, thick fences along the principal 
motion direction and under-estimated parameter M. 

To address above issues, we propose the so-called 
robust temporal median filter (R-TMF), which is possible 
to survive even when corrupted pixels dominate. The 
key idea is to weight the pixels with estimated PoF 
confidences, such that fence pixels are suppressed to gain 
more robustness. Moreover, the global luminance and 
chromatic statistics may be inconsistent between consec- 
utive frames partially owing to automatic camera white 
balancing and environmental lighting change. Under this 
condition N-TMF is known to be sensitive. In contrast R- 
TMF implicitly models such inter-frame alteration based 
on low-rankness matrix prior. 

Formally, given pixel collection {xtjtexr robust esti- 
mator theory has disclosed the equivalence between the 
output of N-TMF and the minimizer of arg min^ W^t — 
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Fig. 9. Illustration of the basic idea of R-TMF. 

/i||i. Our proposed R-TMF actually extends this obser- 
vation into the vector case, wherein any xt is a vector 
rather than scalar as in N-TMF. In current context, xt 
denotes any equivalent vector representation of original 
image matrix (or tensor for multi-channel images). 
We use the notation X = [xi, . . . , X2m+i] to represent 
the 2-D data matrix, piling all xt as its column vectors. 
Mathematically, the goal of R-TMF is to decompose X 
into two additive components, motivated by the robust 
principal component analysis (R-PCA) framework in [24J: 

min II A II* +A II E 111 s.t. X = A^E, (8) 

where || • ||* denotes the matrix nuclear norm, returning 
the sum of its singular values. It is known as a widely- 
used convex surrogate for non-smooth matrix rank. || • ||i 
is matrix ^i-norm, returning the sum of the absolute of 
all matrix elements. Analogously, R-TMF generalizes the 
scalar mean in N-TMF to be matrix nuclear norm (both 
encourage simplicity), and scalar absolute to be matrix 
-^i-norm (both are robust to extremely-large outliers). 

To better clarify the intuition underlying R-TMF, it 
is possible to factorize the low-rank component in (|8| 
as A = PQ^, where P = [pi, . . . ,Pr], Q = fei, • • • , ^r] 
(assume rank {A) = r without loss of generality). In the 
context of video de-fencing, P is comprised of image 
bases and Q conveys the information like camera pa- 
rameters and environmental conditions. Fig. [9] illustrates 
it on toy data, where aligned images X = [/^, . . . ,/^] 
are corrupted by moving shining blob. By optimizing 
^ the uncorrupted images and blobs can be separated. 
Moreover, we further assume the images undergoing 
a continuous luminance attenuation parameterized by 
jk ^ c^-i/i (0 < c < 1). In the ideal case, it is 
possible to find a factorization such that P = I^, Q = 
(c°, c\ . . . , c^"^)^, and rank{A) = rank{PQ^) = 1. 

The formula in (|8| is degraded to median filter when 
A ^ oo. As a 2D extension of N-TMF, it still tends to fail 
under heavy outliers, which are common in the video de- 



fencing context {i.e., fence pixels dominate along many 
spatial-temporal directions in the aligned 2M+1 frames). 
Directly solving ^ is difficult to obtain satisfactory 
results. A straightforward solution is to explicitly specify 
the weights for each element in matrix X, so that the 
adverse effect of the outliers will be mitigated. We choose 
the PoF information for this aim, and enhance ([Sj to 
obtain its weighted version: 

min II A II* +A II 1^0^ 111 (9) 

S.t. W0X = W0A^W0E, (10) 

where is element-wise matrix product and W rep- 
resents the weight matrix with its (/c,p)-th entry equal 
to PoF-induced value 1 — f{p) in frame k. The problem 
in (|9| is convex, whose global optima can be efficiently 
pursued by convex solvers. We propose to utilize an 
efficient optimization algorithm which capitalizes on 
augmented Lagrange multipliers (ALM) [32]. 

For completeness, we first briefly introduce the basics 
of ALM and then sketch the algorithmic pipeline. In 1321 , 
the general method of augmented Lagrange multipliers 
is introduced for solving constrained optimization prob- 
lems of the kind: 

min/(X), s.t. h{X)=0, (11) 

where / : R and : R^ R^ are both 

convex functions. Solving an unconstrained optimization 
problem is typically much easier. To this aim, in ALM we 
instead optimize the following augmented Lagrangian 
function: 

C{X, Y, m) = f{X) + {Y, h{X)) + I II h{X) III, (12) 

where /i is a positive scalar, controlling the strength of 
original constraints in ( pT) . Each iteration optimizes the 
augmented Lagrangian function and passes the updated 
X, Y values as the initialization of next iteration. The 
initial value of /i is exponentially increased until the 
constraints finally rigidly hold. 

Unfortunately, directly applying aforementioned tech- 
nique to (|9| fails to reduce the optimization effort. The 
resultant augmented Lagrangian function has no closed- 
form update for variable A, mainly due to the matrix 
nuclear norm ||A||*. Therefore we further relax ^ by 
introducing another auxiliary variable Z, as follows: 

min II A II* +A II 1^0^ 111 (13) 

s.t. W^D = W(S)Z^W0E (14) 
A = Z (15) 

The augmented Lagrangian function for ([13| can be 
accordingly represented as: 

^A,E,Y,Z,^i) 
= \\A\U+X\\W®E\\i 

+ <Yi,W®{D-Z-E)> + <Y2,A-Z> 
+ ll\\W®{D-Z-E) \\l +^\\A-Z (16) 
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(d) Mask-based Image Completion (e) Our Proposed Method 

Fig. 1 0. Comparison of different estimators on the "WinterPalace" video sequence. 



which is convex with respect to the variables to be 
optimized. Here we adopt an alternating minimization 
strategy. Each optimization iteration consists of four 
steps. In each step, a variable is updated in closed form 
with others fixed. 

Step-I: Update A. The objective function in this step 
can be described as below: 



The above optimization problem also has closed-form 
solution with linear complexity. Refer to Appendix for 
details. 

Step-I V: Update Y. The Lagrangian variables are rou- 
tinely updated as follows: 



1 

arg mm — 

A jil 



1 



A-{Z^Y2/^) \\l (17) 



The optimal solution is achieved in closed form base 
on Singular Value Decomposition (SVD) on matrix A. 
Given a matrix with size m x n, it is well known that 
full SVD has the complexity of 0(min(mn^), min(m^n)). 
See Appendix for details. 

Step-II: Update Z. By setting the first-order derivative 
of ( p^ to be zero, it is possible to obtain the following 
updating rule: 

^ Z ^ Z = ^ {D - E) ^ A^ -{W ^Yi ^¥2), (18) 

where W'^ denotes the abbreviation oi W <S> W. It is 
trivially observed that the optimal Z is obtained from 
direct element-by-element matrix algorithmic operations 
such as multiplication and division. 

Step-III: Update^ For clarity, denote E = W E, 
D = W D and Z = W ^ Z^ltis trivial to obtain the 
optimal from the optimal therefore we only show 
the objective function with respect to E as below: 



E 



. A 

arg mm — 
E M 



E 



E-{D-Z^Y,/fi) 11^.(19) 



Yi = Yi^^{D-Z-E) 
Y2 = Y2^^{A-Z) 



(20) 
(21) 



At the end of each iteration, the value of /i will be 
increased (e.g., ji = miii{pii^Urnax)) to tighten the con- 
straints, where p > 1 is a free parameter. The optimiza- 
tion procedure terminates when the gain of objective 
function is tiny enough, i.e., 



\A 



k-l 



pk-1 



(22) 



where A^^E^ denote the estimation of A^E at the k- 
th iteration respectively. || • ||f is the Frobenius norm 
and e is pre-specified threshold (fixed to be 10~^ in our 
implementation) . 

6 Experiments 
6.1 Dataset Description 

To evaluate the proposed algorithm, we capture nine 
video clips using a Kodak Z650 camera. All of the scenes 
are in Asia (Beijing or Singapore). Fig. 11 shows the 
exemplar frames from these video clips. Most of the 
video clips have short time duration (from 2 seconds 
to 8 seconds) and are generally captured following the 
rules as described in Section |3] We generalize the term of 
"fence'' to anything that is distant to the target scene. For 
example, the pole in the video "Pole'' or the tree in the 
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Pole Tennis WestCoast 

Fig. 1 1 . The video clips used in our evaluations. 

video "Temple''. Obviously it can be hardly epitomized 
by any kind of image regularities, therefore frustrating 
the methods in ID, ||6|. 

6.2 Investigation on R-TMF 

In Fig. [To] we compare the resultant quality of different 
estimators on aligned frames, together with the result 
obtained by the build-in image completion utility in Pho- 
toShop. The superior quality of our proposed method 
proves the effectiveness of temporal information in video 
de-fencing, and also highlights the necessity of data 
weighting towards robust estimation. 

6.3 More Results of Restored Frames 



Fig. 12 presents more restored frames for the adopted 
dataset, where the first three rows display the original 
video frames, the estimated probability-of-fence values 
(linear combination of motion cue and appearance cue 
with equal weights), and the restored frames using 
the proposed method. The adopted video clips cover a 
large spectrum of real-world scenes, which demonstra- 
bly shows the effectiveness of our proposed framework. 

Note that the proposed algorithm has various param- 
eters in different stages, for example, the parameters to 
estimate the optical flow field and the constant used 
in Eqn. (|3|. Empirically we find that the final results 
are quite stable over most of the parameters. The only 
exception is the parameter M, which controls the num- 
ber of consecutive frames used for pixel restoration. As 
illustrated in Fig. |4j the optimal value of parameter M 
is related to several factors, including the spatial extent 
of the fence-like objects, the motion speed etc. When the 
"fence" has a wide span {e.g., the tree in the video clip 
"Temple"), the algorithm is still possible to recover the 
occluded pixels. However, it requires a larger parameter 
of M {i.e., the number of frames that is required to align 



Camera 
Motion 
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(a) Frame #22 




(b) Estimated PoF Values 



(c) Restored Frame 



Fig. 1 3. A failure case caused by the inconsistency of fence geometry 
and camera motion. The camera undergos a horizontal move, which will 
make the horizontal fences impossible to be identified. 



to the target frame. See Section [51) . For the video clip 
"Temple", we set M = 13 (by default M = 7 for others) 
to obtain the restored results in Fig. [l2] 

Larger M is a double-edge sword. It enables the 
recovery of more pixels, and simultaneously complicates 
accurate alignment of all 2M + 1 frames. Fig. [l2| displays 
a local image region, where heavy blur is observed. Note 
that the depth of the enlarged region is very close to the 
"fence". Since the image is mainly aligned with respect 
to the target scene, the mis-alignment for the enlarged 
region is understandably increased. 

The only failure case among the nine video clips is the 
one named "Square". As shown in Fig.[l3| the algorithm 
fails to recover the horizontal bars. It mainly results 
from the inconsistency between fence geometry and 
camera motion, which violates the hypothesis presented 
in Section [5] However, as shown on other video clips, 
when those hypothesis are satisfied, the algorithm works 
reasonably well. 

6.4 Complexity 

Regarding the computational efficacy, the most time- 
consuming operations are the frame alignment in Sec- 
tion |5.1| and R-TMF estimator. Our captured videos all 
have the resolution of 480 x 360 pixels and roughly con- 
tain 40 consecutive frames. The optical flow computation 
between two frames roughly takes 2.2 seconds on our 
desktop computer equipped with 8G bytes memory and 
Intel Q9559 CPU. Overall 2M (typically M = 7) optical 
flow optimizations are involved during frame alignment, 
which indicates a rough time cost of 30 seconds. Another 
40 seconds are spent on R-TMF estimator. Generally the 
restoration of each single frame is accomplished within 
80 seconds. 

7 Conclusions, Limitations and Future 
Perspective 

In this paper we present a new research topic, the so- 
called video de-fencing, and propose a framework based 
on parallax-aware sub-pixel frame alignment and robust 




pixel restoration. The current solution is focusing on 
the videos with static scenes. We evaluate the proposed 
method on a set of real-world consumer videos and gen- 
erate promising results on most of them. The proposed 
robust temporal median filter (R-TMF) is a general tool 
that can be applied in numerous applications. 

The limitation of our work in this paper mainly lies 
in the incapability of handling moving objects, since 
a moving object will disrupt the estimation of depth 
of field from pixel displacement on the video frames. 
Likewise, the proposed method has difficulty when the 
background scene and fence-like objects have similar 
depths. Another down side of the proposed method 
is the requirement of video capturing as introduced in 
Section O 

Regarding the future work, the proposed framework is 
expected to be extended along the following directions: 

• Extension to the dynamic scenes. They are chal- 
lenging, since it is hard to distinguish the paral- 
lax caused by depth discrepancy or object motion. 
With high probability the moving objects will be 
judged as "fence'' and removed. To disambiguate 
these two kinds of parallax, additional cues will be 
used. For example, it can be assumed that pixels 
from the fence are homogeneously subject to specific 
appearance model (e.g., color-based Gaussian mix- 
ture model), which excludes the pixels from moving 
objects. Another possible solution is manually spec- 



ifying the mask of the fence at several key-frames. 

• Enhanced sub-pixel image alignment with addi- 
tional constraints. Conventional optical flow meth- 
ods seldom target those scenes with diverse depth 
as in the application of video de-fencing. Additional 
constraints {e.g., SIFT point correspondence) can be 
further incorporated to enhance the accuracy of 
image alignment. Moreover, it is also helpful to 
develop a user-friendly interface that takes users 
into the loop. 

• Public benchmark for quantitative comparison. In 
this work we construct a comprehensive video set 
for qualitative evaluations. However, it is difficult 
to obtain the ground truth videos (i.e., videos cap- 
tured under the same settings yet without fence-like 
objects) for such real-world videos. In the future 
we will try to establish a public benchmark from 
artificially fenced sequences for comparing different 
algorithms developed for the video de-fencing task. 

Appendix 

In this section we introduce two optimization problems 
mentioned in Section 5.2 First we define the following 
soft-thresholding operator: 



X — e, ii X > e 
Se[x] = ^ X + e, if X < -e 
0, otherwise 



(23) 
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The effect of Se[x] is shrinking x towards zero, con- 
trolled by parameter e. Based on the above operator, it 
is possible to pursue the closed-form solutions for the 
following optimization problems 1331 : 

US,[S]V^ = e.vgmme\\X\U + ^\\X-W\\%, (24) 

S,[W] = argmine||X||i + i||X-l^|||, (25) 

where W = USV^ denotes the SVD of W. 
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